Add transaction logs and basic conflict resolution #403

paraseba · 2024-11-21T14:36:21Z

No description provided.

Fix a couple of minor bugs too

Also, better performance by avoiding a big clone of the ChangeSet and more tests.

rabernat

Seba this looks amazing. I gave this what I would call a "shallow" review. Overall the approach makes sense, the data structures look right, tests are legible, etc.

The following observation is in no way a blocker for moving forward with this PR...

I wonder about the overall design of the transaction logs. Right now the logs are useful for detecting conflicts only, but they don't have the full information from the changeset. I'm curious whether it would make sense to just serialize the full changeset as the transaction log. This is what lance does: https://lancedb.github.io/lance/format.html

The advantage of this is that you could potentially recover an entire session if a commit fails. (Currently this information just lives in memory.) The downside of course is that the changeset is much larger and more expensive to store / parse.

rabernat · 2024-11-23T13:18:59Z

icechunk/src/conflicts/basic_solver.rs

+                        #[allow(clippy::panic)]
+                        VersionSelection::Fail => panic!("Bug in conflict resolution: ChunkDoubleUpdate flagged as unrecoverable")


Are you saying we will never actually hit this code path in normal usage?

Yes, this possibility is filtered out above, but the compilers doesn't know it.

icechunk/src/conflicts/detector.rs

rabernat · 2024-11-23T13:28:02Z

icechunk/src/repository.rs

+            for snap_id in new_commits.into_iter().rev() {
+                let tx_log = self.storage.fetch_transaction_log(&snap_id).await?;


One thought about this loop which is probably a very premature optimization...

Rather than fetching each commit one at a time and computing the conflicts one at a time, couldn't we fetch all of the transaction logs (in one async gather), merge them together, and compute conflicts once?

This optimization could be important if there are lots of commits.

Great question. That's what I had initially, and it's what sounded more intuitive to me too. Following is why I gave up on it on this first pass, we can always go back eventually:

The logic o merging the transaction logs is not trivial. In fact, it looks a lot like the conflict resolution logic. Example: if one commit deletes an array and the next commit creates a group on the same path, and edits the attributes, etc. It gets very messy, and it will be pretty crazy once we introduce rename

In the current design I was very careful in that if you are able to rebase only half of the commits, you get back a repo rebased to that point, you can change your policy (or do some manual changes) and continue with another rebase. If we initially merge all the tx logs it's all or nothing.

Memory usage is much larger if you have to merge many tx logs.

Error reporting becomes untractable under tx log merge

I should have written a design document with all these details, but I was to bored of this problem already hehe.

As i mentioned somewhere, I don't feel I got to the best design for this problem. I'm pretty sure we'll have to revisit soon.

rabernat · 2024-11-23T13:33:44Z

icechunk/src/repository.rs

@@ -725,6 +734,63 @@ impl Repository {
        }
    }

+    pub async fn rebase(


A comment about what exactly this important function does would be very helpful.

My interpretation is that, if successful, it mutates the state of the repo such that it can commit successfully to the specified branch.

Yes, your interpretation is correct. I'll document, and include the tricky case of "partial success" which is important.

rabernat · 2024-11-23T13:34:54Z

icechunk/src/repository.rs

+    pub async fn rebase(
+        &mut self,
+        solver: &dyn ConflictSolver,
+        update_branch_name: &str,


The default should be to update against the repo's current active branch. I can see always having to specify this explicitly as a potential pain point for users. 95% of the time there will just be a single branch active in a repo.

Edit: I guess this is just a general feature of the Rust API, so probably not a big deal.

Yeah, this is the low level API, and there is no notion of current branch at this level, but we will definitely make it the default (or only option) in the python bindings.

rabernat · 2024-11-23T13:38:30Z

icechunk/src/repository.rs

+            repo2.rebase(&ConflictDetector, "main").await,
+        );
+        Ok(())
+    }


These tests are very clear and easy to follow.

rabernat · 2024-11-23T13:49:08Z

icechunk/src/repository.rs

+        repo2
+            .set_user_attributes(
+                path.clone(),
+                Some(UserAttributes::try_new(br#"{"foo":"bar"}"#).unwrap()),


Just noting that this conflict should actually be resolvable since the changes are the same! 😆 But I understand of course that sort of resolution is not implemented yet.

Yes!!! This is a bit more advanced feature that I want us to have eventually. I made it so to get the test failure once we implement.

rabernat · 2024-11-23T14:22:18Z

icechunk/src/repository.rs

+    }
+
+    #[tokio::test()]
+    async fn test_rebase_without_fast_forward() -> Result<(), Box<dyn Error>> {


It was hard for me to understand what this is testing just by reading the code. Consider adding some comments to this and the next test.

great, I'll add a brief comment to every test

rabernat · 2024-11-23T14:24:32Z

icechunk/src/repository.rs

+
+        Ok(())
+    }
+


General comment: coming from python, pytest, and the wonderful world of test parametrization, these tests feel very verbose and somewhat repetitive. Not sure if there is any remedy for that.

yeah, it's painful ... people tend to use macros. But in general, our tests are bad, too verbose, no helper functions. I want us to refactor all of them into something much more readable and easy to write. We have #380 to work on it.

rabernat · 2024-11-23T14:25:35Z

icechunk/src/format/transaction_log.rs

+    pub deleted_arrays: HashSet<NodeId>,
+    pub updated_user_attributes: HashSet<NodeId>,
+    pub updated_zarr_metadata: HashSet<NodeId>,
+    pub updated_chunks: HashMap<NodeId, HashSet<ChunkIndices>>,


Does updated chunks include deleted chunks?

More generally, how do we handle deleted chunks? Do we treat them the same as any other chunk write?

Yes, it's included there. For conflict detection today, we are treating it as any other write, which is not great. If they both delete a chunk it shouldn't be a conflict. But also, easy to fix with the default UseOurs policy. We can revisit later unless you have an important use case I haven't thought about.

Co-authored-by: Ryan Abernathey <[email protected]>

paraseba · 2024-11-23T15:11:23Z

I'm curious whether it would make sense to just serialize the full changeset as the transaction log

It's definitely a valid point in the design space. This was my reasoning to chose the current style:

Most of the time you don't need or want to recover the session, it would make things slower to save you 1 every 10k times.
When you do, you try to protect from machine failure during the whole session, not during the brief time the commit lasts. If your machine dies during the 8 hours your writes last, you lose everything.
For that, we would need real session checkpoints, and those really look like storing the change set. But they need to be cleaned, we need to decide on a frequency for checkpoints, etc. So it felt like a very interesting feature to add later. Also, it's a feature that sounds very connected to "change sets that don't fit in memory", which is also something we should implement eventually.
The transaction logs have less information than the change sets, but in a more useful way, which simplifies the rebase (and hopefully the diff) functionality. For example, in the tx log I'm storing the node ids of all deleted nodes, as oppose to the change set, where we have only the paths.

I wish I had written a design doc...

mpiannucci

This is impressive and the tests are really easy to follow along to

mpiannucci · 2024-11-23T21:07:32Z

icechunk/src/format/transaction_log.rs

+pub struct TransactionLog {
+    // FIXME: better, more stable on-disk format
+    pub icechunk_transaction_log_format_version: IcechunkFormatVersion,
+    pub new_groups: HashSet<NodeId>,


So i understand correctly, we keep NodeId for the changes, which the Repository can then (in the future) resolve into a more readable or custom format by resolving the zarr path from the ID in the given manifest right? Does the NodeId refer to the nodes id in the incoming changeset or from the existing changset?

Yes, that is correct. NodeIds are slightly better than paths, because they are random, so something like create /foo -> delete /foo -> create /foo doesn't get tricky.

This is the NodeId assigned in the change that generate this transaction, so in your description, the existing changeset.

paraseba force-pushed the push-kmsppxuloqls branch 4 times, most recently from ef8ebfb to cb287a1 Compare November 23, 2024 03:42

paraseba added 7 commits November 23, 2024 00:45

Add transaction logs and basic conflict resolution

769c44e

Add conflict detection and resolution unit tests

ecaebf1

Fix a couple of minor bugs too

No public fields in ChangeSet

9ad3688

Faster path finding

a0a3d13

Fix compiler warnings

a1d625a

Partial successful rebase maintain the right parent snapshot

553b3d5

Also, better performance by avoiding a big clone of the ChangeSet and more tests.

Ruff

c1178a7

paraseba force-pushed the push-kmsppxuloqls branch from cb287a1 to c1178a7 Compare November 23, 2024 03:45

paraseba requested review from mpiannucci and dcherian November 23, 2024 03:49

rabernat approved these changes Nov 23, 2024

View reviewed changes

Update icechunk/src/conflicts/detector.rs

61d447c

Co-authored-by: Ryan Abernathey <[email protected]>

mpiannucci approved these changes Nov 23, 2024

View reviewed changes

Document code

94e9c64

paraseba force-pushed the push-kmsppxuloqls branch from f75c7f1 to 94e9c64 Compare November 24, 2024 02:24

disable ruff in CI

fc07fc9

paraseba merged commit 459296e into main Nov 24, 2024
3 checks passed

paraseba deleted the push-kmsppxuloqls branch November 24, 2024 02:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add transaction logs and basic conflict resolution #403

Add transaction logs and basic conflict resolution #403

paraseba commented Nov 21, 2024

rabernat left a comment

rabernat Nov 23, 2024

paraseba Nov 23, 2024

rabernat Nov 23, 2024

paraseba Nov 23, 2024

rabernat Nov 23, 2024

paraseba Nov 23, 2024

paraseba Nov 24, 2024

rabernat Nov 23, 2024 •

edited

Loading

paraseba Nov 23, 2024

rabernat Nov 23, 2024

rabernat Nov 23, 2024

paraseba Nov 23, 2024

rabernat Nov 23, 2024

paraseba Nov 23, 2024

paraseba Nov 24, 2024

rabernat Nov 23, 2024

paraseba Nov 23, 2024

rabernat Nov 23, 2024

paraseba Nov 23, 2024

paraseba commented Nov 23, 2024

mpiannucci left a comment

mpiannucci Nov 23, 2024 •

edited

Loading

paraseba Nov 23, 2024

		#[allow(clippy::panic)]
		VersionSelection::Fail => panic!("Bug in conflict resolution: ChunkDoubleUpdate flagged as unrecoverable")

		for snap_id in new_commits.into_iter().rev() {
		let tx_log = self.storage.fetch_transaction_log(&snap_id).await?;

Add transaction logs and basic conflict resolution #403

Add transaction logs and basic conflict resolution #403

Conversation

paraseba commented Nov 21, 2024

rabernat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rabernat Nov 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paraseba commented Nov 23, 2024

mpiannucci left a comment

Choose a reason for hiding this comment

mpiannucci Nov 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rabernat Nov 23, 2024 •

edited

Loading

mpiannucci Nov 23, 2024 •

edited

Loading